背景:虽然卷积神经网络(CNN)实现了检测基于磁共振成像(MRI)扫描的阿尔茨海默病(AD)痴呆的高诊断准确性,但它们尚未应用于临床常规。这是一个重要原因是缺乏模型可理解性。最近开发的用于导出CNN相关性图的可视化方法可能有助于填补这种差距。我们调查了具有更高准确性的模型还依赖于先前知识预定义的判别脑区域。方法:我们培训了CNN,用于检测痴呆症和Amnestic认知障碍(MCI)患者的N = 663 T1加权MRI扫描的AD,并通过交叉验证和三个独立样本验证模型的准确性= 1655例。我们评估了相关评分和海马体积的关联,以验证这种方法的临床效用。为了提高模型可理解性,我们实现了3D CNN相关性图的交互式可视化。结果:跨三个独立数据集,组分离表现出广告痴呆症与控制的高精度(AUC $ \ GEQUQ $ 0.92)和MCI与控制的中等精度(AUC $ \约0.75美元)。相关性图表明海马萎缩被认为是广告检测的最具信息性因素,其其他皮质和皮质区域中的萎缩额外贡献。海马内的相关评分与海马体积高度相关(Pearson的r $ \大约$ -0.86,p <0.001)。结论:相关性地图突出了我们假设先验的地区的萎缩。这加强了CNN模型的可理解性,这些模型基于扫描和诊断标签以纯粹的数据驱动方式培训。
translated by 谷歌翻译
An important class of techniques for resonant anomaly detection in high energy physics builds models that can distinguish between reference and target datasets, where only the latter has appreciable signal. Such techniques, including Classification Without Labels (CWoLa) and Simulation Assisted Likelihood-free Anomaly Detection (SALAD) rely on a single reference dataset. They cannot take advantage of commonly-available multiple datasets and thus cannot fully exploit available information. In this work, we propose generalizations of CWoLa and SALAD for settings where multiple reference datasets are available, building on weak supervision techniques. We demonstrate improved performance in a number of settings with realistic and synthetic data. As an added benefit, our generalizations enable us to provide finite-sample guarantees, improving on existing asymptotic analyses.
translated by 谷歌翻译
Following the advent of immersive technologies and the increasing interest in representing interactive geometrical format, 3D Point Clouds (PC) have emerged as a promising solution and effective means to display 3D visual information. In addition to other challenges in immersive applications, objective and subjective quality assessments of compressed 3D content remain open problems and an area of research interest. Yet most of the efforts in the research area ignore the local geometrical structures between points representation. In this paper, we overcome this limitation by introducing a novel and efficient objective metric for Point Clouds Quality Assessment, by learning local intrinsic dependencies using Graph Neural Network (GNN). To evaluate the performance of our method, two well-known datasets have been used. The results demonstrate the effectiveness and reliability of our solution compared to state-of-the-art metrics.
translated by 谷歌翻译
Despite the recent success of multi-task learning and pre-finetuning for natural language understanding, few works have studied the effects of task families on abstractive text summarization. Task families are a form of task grouping during the pre-finetuning stage to learn common skills, such as reading comprehension. To close this gap, we analyze the influence of multi-task learning strategies using task families for the English abstractive text summarization task. We group tasks into one of three strategies, i.e., sequential, simultaneous, and continual multi-task learning, and evaluate trained models through two downstream tasks. We find that certain combinations of task families (e.g., advanced reading comprehension and natural language inference) positively impact downstream performance. Further, we find that choice and combinations of task families influence downstream performance more than the training scheme, supporting the use of task families for abstractive text summarization.
translated by 谷歌翻译
The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work. However, the role of large autoregressive transformers in generating machine-paraphrased plagiarism and their detection is still developing in the literature. This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia. We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples. Our results suggest that large models can rewrite text humans have difficulty identifying as machine-paraphrased (53% mean acc.). Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3) achieves a 66% F1-score in detecting paraphrases.
translated by 谷歌翻译
Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.
translated by 谷歌翻译
部署AI驱动的系统需要支持有效人类互动的值得信赖的模型,超出了原始预测准确性。概念瓶颈模型通过在类似人类的概念的中间级别调节分类任务来促进可信度。这使得人类干预措施可以纠正错误预测的概念以改善模型的性能。但是,现有的概念瓶颈模型无法在高任务准确性,基于概念的强大解释和对概念的有效干预措施之间找到最佳的妥协,尤其是在稀缺完整和准确的概念主管的现实情况下。为了解决这个问题,我们提出了概念嵌入模型,这是一种新型的概念瓶颈模型,它通过学习可解释的高维概念表示形式而超出了当前的准确性-VS解关性权衡。我们的实验表明,嵌入模型(1)达到更好或竞争性的任务准确性W.R.T. W.R.T.没有概念的标准神经模型,(2)提供概念表示,以捕获有意义的语义,包括其地面真相标签,(3)支持测试时间概念干预措施,其在测试准确性中的影响超过了标准概念瓶颈模型,以及(4)规模对于稀缺的完整概念监督的现实条件。
translated by 谷歌翻译
自适应多机构系统(AMAS)将机器学习问题转变为代理之间的本地合作问题。我们提出了Smapy,这是一种基于合奏的AMA用于移动性预测的实施,除合作规则外,还为其代理提供了机器学习模型。通过详细的方法,我们表明,如果将线性模型集成到合作多代理结构中,则可以在基准传输模式检测数据集上使用线性模型进行非线性分类。获得的结果表明,由于多代理方法,在非线性环境中线性模型的性能有了显着改善。
translated by 谷歌翻译
弱监督(WS)是一种有力的方法,可以构建标记的数据集,面对几乎没有标记的数据,用于培训监督模型。它用标签函数(LFS)表达的多个嘈杂但廉价标签的估计取代了手持标签数据。尽管它已成功地用于许多域中,但弱监督的应用程序范围受到构造具有复杂或高维特征的域的标记功能的困难。为了解决这个问题,少数方法提出了使用一小部分地面真实标签自动化LF设计过程的方法。在这项工作中,我们介绍了aettos-bench-101:在挑战WS设置中评估自动化WS(autows)技术的框架 - 以前难以或不可能应用传统的WS技术是一组不同的应用程序域。虽然AtoW是扩展WS应用程序范围的有希望的方向,但诸如零击基础模型之类的强大方法的出现揭示了需要了解介绍技术如何与现代零射击或几次学习者进行比较或合作。这为autows-bench-101的中心问题提供了信息:给定每个任务的初始集100个标签,我们询问从业者是否应使用autows方法生成其他标签或使用一些简单的基线,例如来自基础模型或监督学习。我们观察到,在许多情况下,如果启动方法要超越基础模型的信号,则有必要超越简单的几个基线,而autows bench-101可以促进该方向的未来研究。我们以详尽的介绍方法进行彻底消融研究。
translated by 谷歌翻译
我们研究数据所有者/卖方的数据搜索者/买家的数据。假设特定的实用程序指标(例如验证集中的测试准确性)在实践中可能不存在,则通常针对特定任务进行数据估值。在这项工作中,我们专注于任务不足的数据评估,而无需任何验证要求。数据购买者可以访问有限数量的数据(可以公开使用),并从数据销售商那里寻求更多数据示例。我们将问题提出,以估计卖方在买方可用的基线数据方面数据的统计属性差异。我们通过衡量买方数据的多样性和相关性来捕获这些统计差异;我们在不要求原始数据的情况下向卖方估算这些措施。我们通过提出的方法设计查询,以使卖方对买方的原始数据视而不见,并且不知道对查询的响应进行响应,以获得多样性和相关性权衡的期望结果。我们将通过对真实的广泛实验进行展示。拟议估计值的表格和图像数据集捕获了买方卖方数据的多样性和相关性。
translated by 谷歌翻译